Embarking on my Machine Learning (ML) project I am delving into the world of Community Notes (CN) on X, formerly known as Twitter. I will use ML tools to find the best model to predict the number of ratings that each notes has received.
Community Notes, found on platforms like X (formerly Twitter), play a crucial role in combating misinformation and improving content moderation. They allow users to add context to posts, providing diverse viewpoints to counter potential biases. The algorithm guiding Community Notes emphasizes consensus, ensuring that it’s not just about majority agreement. This collaborative approach empowers users to contribute to a more informed online space and brings transparency to the fact-checking process. The significance of Community Notes lies in their ability to debunk misinformation, lessen the impact of misleading content, and encourage a collective effort toward promoting accuracy in digital conversations.
Community Notes stands out as an innovative platform where contributors collaboratively add context to potentially misleading posts, challenging conventional content moderation methods. The publication of a CN is driven not by a majority rule but by the agreement of contributors who have previously disagreed, creating a transparent, community-driven approach to combat misinformation.
This sounds like a great idea, but it has been proven sometimes good but insufficient2 or irrelevant3, and even susceptible to disinformation4, as you can see in more detail in the Wikipedia page dedicated to Community Notes (CN).
At the core of this exploration is the open-source algorithm powering CN, described as “insanely complicated.” This algorithm ensures that notes are rated by a diverse range of perspectives, incorporating an opinion classification based on contributors’ alignment with the left and right-wing political spectrum. It is only after people that previously disagree, agree on the helpfulness of a note that the note is posted. Therefore, the number of ratings that a note receives is very important to determine if the note is ever published, and how fast.
The project is centered around a vast data set comprising around 380,000 notes, each representing a collaborative effort to combat misinformation. Of particular significance is the attempt to predict the number of ratings received by each note, as this is a crucial determinant in deciding whether a note is published. This predictive aspect adds a layer of complexity to our analysis, aiming to uncover insights into the collaborative evaluation system and its impact on the publication of notes.
This notes can be related to any topic and even advertising. It is worth noting that the most rated notes was about a game.
The openness of the data invite scrutiny and analysis, fostering an environment where skepticism can be transformed into informed inquiry. Join me on this journey as we explore the intricacies of Community Notes.
The data from the notes and the ratings are open to anyone with an account on X. On the github page of CN you can also find the code and algorithm. Here are the sources:
Data from the project from X, Community Notes, can be found here.
And the code from Community Notes is in Github.
Since the data sets are very large, I save the final data set with the information I needed from each one. In this section I explain the original data sets and the creation of the final merged data used for the present project.
In this section, I present the code I used to create and mere the raw
data to create the final dataset. If you want to replicate the code,
just crate a folder named data on your working directory
and download all the data directly from X. This section could be skipped
and you can continue to the EDA where I work with the final merged data
set.
The raw data can be downloaded directly from X, and they provide a good description of all the variables in each data set. They update the data continuously. For this project all the data was downloaded December 3rd. I will provide a codebook of the final data set I created from the raw data as documentation.
You have to download the file: notes-00000.tsv
# packages
library(pacman)
p_load(tidyverse,
lubridate,
naniar,
janitor,
forcats)
rm(list = ls())
# load data ----
# all the data was downloaded on December 3rd 2023
# notes
notes <- read_tsv("data/notes-00000.tsv") %>% clean_names()
## Select the variables that will be used in the model from the notes dataset ----
notes_final <-
notes %>% select(
note_id,
tweet_id,
classification,
trustworthy_sources,
summary,
is_media_note,
created_at_millis
) %>%
mutate(created_at = as.POSIXct(created_at_millis, origin="1970-01-01")) %>%
mutate(w_day = wday(created_at, label = T),
hour = as_factor(hour(created_at)),
note_length = nchar(summary)) %>%
select(-c(created_at_millis,
summary))
You have to download the file: noteStatusHistory-00000.tsv
## status
status <- read_tsv("data/noteStatusHistory-00000.tsv") %>% clean_names()
# the observations in the dataset are almost unique
length(unique(status$note_id))
# however they are all rated as "NEED MORE RATINGS"
duplicated_notes_status <-
status %>% group_by(note_id) %>%
summarise(n_notes = n()) %>%
filter(n_notes>1) %>%
pull(note_id) %>%
format(scientific = F)
status %>% filter(note_id %in% duplicated_notes_status) %>% View()
## select variables from ratings at the notes level ----
rates_summarise <-
bind_rows(r0,
r1,
r2,
r3) %>%
group_by(note_id) %>%
summarise(
ratings = n(),
agreement_rate = sum(agree)/n(),
helpful_rate = sum(helpfulness_level =="HELPFUL",na.rm = T)/n(),
not_helpful_rate = sum(helpfulness_level =="NOT_HELPFUL",na.rm = T)/n(),
somewhat_helpful_rate = sum(helpfulness_level =="SOMEWHAT_HELPFUL",na.rm = T)/n()
)
You have to download the files: ratings-00000.tsv, ratings-00001.tsv, ratings-00002.tsv, ratings-00003.tsv.
## ratings
r0 <- read_tsv("data/ratings-00000.tsv") %>% clean_names()
r1 <- read_tsv("data/ratings-00001.tsv") %>% clean_names()
r2 <- read_tsv("data/ratings-00002.tsv") %>% clean_names()
r3 <- read_tsv("data/ratings-00003.tsv") %>% clean_names()
## select variables from ratings at the notes level ----
rates_summarise <-
bind_rows(r0,
r1,
r2,
r3) %>%
group_by(note_id) %>%
summarise(
ratings = n(),
agreement_rate = sum(agree)/n(),
helpful_rate = sum(helpfulness_level =="HELPFUL",na.rm = T)/n(),
not_helpful_rate = sum(helpfulness_level =="NOT_HELPFUL",na.rm = T)/n(),
somewhat_helpful_rate = sum(helpfulness_level =="SOMEWHAT_HELPFUL",na.rm = T)/n()
)
The previous data sets will be merged in a file named
notes_merged.RData. This data set contains all the
variables used in the analysis, and their description can be found in
the codebook.
# merge data ----
notes_merged <- left_join(notes_final, rates_summarise, by = join_by(note_id)) %>%
# some notes never received ratings, let's replace them with 0
replace_na(list(ratings = 0,
agreement_rate = 0,
helpful_rate = 0,
not_helpful_rate = 0,
somewhat_helpful_rate = 0))
notes_merged <-
left_join(x = notes_merged,
y = status %>%
# select onlye the non duplicated rows
filter(!(note_id %in% duplicated_notes_status)) %>%
# I only analyze the current status
select(note_id,current_status),
by = join_by(note_id))
save(notes_merged,file = "data/notes_merged.RData")
# Load Packages
library(pacman)
p_load(tidyverse,
tidymodels,
recipes,
kknn,
yardstick,
tune,
ggplot2,
ggthemes,
rsample,
parsnip,
workflows,
corrplot
)
load("../data/notes_merged.RData")
In the data set there is only 3 values with missing data. When I looked for these specific tweets there were nothing posted. I assume this were mistakes. Given the magnitude of the data set it won’t affect the analysis if we just remove them. The ID of the notes with missing data are:
# missing data ----
# there are 3 rows with missing values
missing_cell <- which(is.na(notes_merged), arr.ind = TRUE)
# These are the tweet ids
notes_merged[missing_cell[,1],] %>% select(tweet_id) %>% pull() %>% format(scientific = F) %>% unique()
## [1] "1370126844251435008" "1381787182394904576" "1370110240532930560"
## [4] "1731779816842477824" "1665427964278763520" "1731728969643241472"
## [7] "1731596148018721280" "1731774989118882304" "1731257785331892224"
## [10] "1731619302606782720" "1731512206179897600" "1731477518765408768"
## [13] "1731743813129928960" "1731728512585744896" "1731724799259382016"
## [16] "1731482008675635200" "1731748271649939456" "1731607105218494720"
## [19] "1731661743720427776" "1731373412637901056" "1731425602639479040"
## [22] "1731400644546974208" "1731331295878263040"
# There is no summary in this notes, probably this was a mistake
# There is nothing in the note 1370110240532930560 that had 8 ratings. The other two notes were never rated.
# I remove the missing values, given the content and the number of missing values this shouldn't be an issue
notes_merged <- notes_merged %>% drop_na()
Let’s first check the summary statistics of the number of ratings in each note:
# Analyzing the outcome variable ----
summary(notes_merged$ratings)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 5.00 18.00 49.37 52.00 8069.00
There is a lot of variability, and it is clear that many notes receive a lot of attention. The median is far less than the mean.
Lets check this looking at the histograms. I separate the notes by the lowest 99% and the highest 1% by number of ratings. We can see that many notes are never rated, but many notes receive some level of attention. For the most popular ones, it is clear that one note creates a lot of distortion.
# 99% of the notes have less than 453 ratings
q_99 <- quantile(notes_merged$ratings,probs = 0.99)
notes_merged %>%
filter(ratings < q_99) %>%
ggplot() +
geom_histogram(aes(x= ratings)) +
labs(
title = "Histogram of the number of Ratings on each Note",
subtitle = "Percentile 99 of the Ratings",
x="Number of Ratings"
) +
theme_bw()
notes_merged %>%
filter(ratings >= q_99) %>%
ggplot() +
geom_histogram(aes(x= ratings)) +
labs(
title = "Histogram of the number of Ratings on each Note",
subtitle = "1% of Notes with more Ratings",
x="Number of Ratings"
) +
theme_bw()
The note with the most ratings is about advertising:
# The note with most ratings
notes_merged %>% filter(ratings>6000) %>% arrange(ratings) %>%
select(tweet_id) %>%
pull() %>%
format(scientific = F)
## [1] "1672908357081124864"
# Correlations ----
# Correlation matrix
notes_cor <- cor(notes_merged %>%
select(-ends_with("id")) %>%
select_if(is.numeric))
# Visualization of correlation matrix
notes_corrplot <- corrplot.mixed(notes_cor,
lower = 'shade', upper = 'pie', order = 'hclust',
addCoef.col = 1, number.cex = 0.7,
tl.pos = "lt"
)
The strongest correlations are between the length of the note and if the note has a trustworthy source (0.4484799), if the note has rated as helpful or not helpful (-0.5131066), the agreement rate with the time (in milliseconds) the note was created (-0.4068341), and the helpful level with the time the note was created (0.2279972). The correlations are in general very low between the numeric predictors.
The correlation of helpfulness with non helpfulness is expected since when user rate the only other option is “somewhat helpful”. This is the reason the correlation is not perfectly negative. It is interesting to notice that the agreement decreased with the time the note was created, meaning that with more time of this policy, people started to agreed less, which is expected if more people if more people is participating, however the correlation between the time of creation and number of ratings is low. It is interesting that the time of creation is at the same time increasing the both the helpful and non helpful rates. This might be related to the fact that people have become more certain in their ratings since less people is selecting the somewhat helpful option.
It seems that more ratings are associated with the note rated as helpful, which is expected from how the algorithm is described. Also, it is notes classified as not “helpful” receive more ratings. This is true for looking at the mean, meadian and 10th and 90th quantile.
rmarkdown::paged_table(
notes_merged %>% group_by(current_status) %>%
summarise(mean(ratings), median(ratings),
quantile(ratings,probs = 0.1),
quantile(ratings,probs = 0.9)
)
)
In terms of what the note says about the tweet, more notes say that the tweet is “MISINFORMED_OR_POTENTIALLY_MISLEADING”. This is specially marked for the notes rated as “HELPFUL” were virtually all the notes say the tweet was misinformed or potentially misleading.
rmarkdown::paged_table(
as.data.frame(
table(notes_merged$classification, notes_merged$current_status)) %>%
pivot_wider(names_from = Var1,values_from = Freq) %>%
rename("Current State" = Var2)
)
Finally, the number of ratings is very similar between the classification of notes.
rmarkdown::paged_table(
notes_merged %>% group_by(classification) %>%
summarise(mean(ratings), median(ratings),
quantile(ratings,probs = 0.1),
quantile(ratings,probs = 0.9)
)
)
rmarkdown::paged_table(
notes_merged
)
Now that I have the final data set notes_merged, I can
start the analysis. We have to import some packages including
tidymodels, yardstick and
tune.
# Load Packages
library(pacman)
p_load(tidyverse,
tidymodels,
recipes,
kknn,
yardstick,
tune,
ggplot2,
ggthemes,
rsample,
parsnip,
workflows,
vip
)
The data is very large, and the models were already taking several minutes to finish. Because of this, I decided to take a \(5\%\) random sample of the complete data. This sample data is still large, with more than \(17,000\) observations.
# To reproduce the results
set.seed(1984)
# Load data ----
load("data/notes_merged.RData")
# lets get a subset of the data
notes_merged <- notes_merged %>%
filter(created_at > "2022-11-25 15:30:30 UTC") %>%
slice_sample(prop = 0.05)
# Sample from the original data set
save(notes_merged, file = "data/notes_merged_sample.RData")
load("../data/notes_merged_sample.RData")
Now, let’s split the original data set to make the analysis. I
decided to use \(75\%\) of the data for
training, and the sampling is stratified at the outcome variable
ratings.
# Split data and cross validation ----
training_percentage <- 0.75
split_notes <- initial_split(notes_merged,
prop = training_percentage,
strata = ratings)
train_notes <- training(split_notes)
test_notes <- testing(split_notes)
The proportion of observations in the training set was 0.7498841, and 0.2501159 for the test set. These numbers are closed to the proportion stated.
In the context of Community Notes within machine learning, envisioning the data set as a collection of notes, the process of dividing this data into training, testing, and validation sets becomes analogous to strategizing how to understand and predict the behavior of future notes. The existing notes serve as a sample, providing insights into how contributors have added context to posts in the past. However, it’s imperative not to assume that the future usage of notes will precisely mirror historical patterns.
Much like a training set, a substantial portion of existing notes would be allocated to allow the model to learn patterns, relationships, and features inherent in the data. This phase involves understanding how contributors have historically interacted with posts, detecting common themes, and learning the dynamics of note creation. The testing set, representative of notes yet unseen by the model, acts as a simulated evaluation of the model’s ability to generalize its learning to new instances of notes. This evaluation is critical in anticipating how well the model would adapt to future notes scenarios.
To account for the unpredictability and potential evolution in how contributors may use CN in the future, a validation set becomes paramount. This set serves as a means of fine-tuning the model, preventing it from over fitting the historical data and ensuring that it doesn’t make assumptions based solely on past patterns. The aim is to create a model that is not only proficient in understanding the existing CN but is also equipped to adapt to unforeseen events and new patterns that may emerge in future note creation.
v_folds <- 10
notes_folds <- vfold_cv(train_notes, v = v_folds, strata = "ratings")
I also divide the training data from Community Notes into 10 folds to perform a K-Fold Cross Validation. This is an advanced technique employed to rigorously evaluate the performance of a machine learning model. The process commences by partitioning the dataset into ten subsets, ensuring an equitable distribution of the Community Notes data. Subsequently, the model undergoes ten iterations of training and testing, where each fold is sequentially designated as the testing set, and the remaining folds collectively form the training set.
During each iteration, the machine learning model is trained on the training set, enabling it to discern intricate patterns and nuances within the Community Notes data. The model’s proficiency is then rigorously tested on the designated testing set, elucidating its capacity to generalize across diverse subsets of the dataset. Performance metrics, encompassing accuracy, precision, and recall, are meticulously recorded for each iteration, furnishing a granular assessment of the model’s efficacy.
The ultimate evaluation derives from averaging the performance metrics across all ten iterations. This calculated average serves as a comprehensive measure of the model’s generalization prowess, providing a nuanced understanding of its robustness in handling the intricacies and variations inherent in the diverse landscape of Community Notes. In essence, the adoption of K-Fold Cross Validation with 10 folds at a graduate level ensures a thorough and reliable evaluation of the machine learning model’s competence in understanding and predicting patterns within the multifaceted Community Notes dataset.
To adjust the data I used all the predictors, except
for note_id, tweet_id,
created_at, and current_status. The first two
variables are IDs, and the later two shouldn’t have an effect. As
mentioned in the EDA section, created_at_millis has a small
correlation with the ratings, and this variable capture the effect of
time, so created_at is redundant. Other cyclical effects of
time are analyzed with the hour and w_day
variables. Finally, current_status is a consequence of the
number of ratings, and not the other way around. Therefore, is of little
interest as a predictor.
Dummy variables were created in the recipe for w_day,
hour, and classification. Finally, the
predictors were normalized and the constant withdrawn. It is worth
mentioning that the variable agreement_rate became constant
because frequently this variable was \(0\). Other sampling stratified by this
variable could have been used, but since the correlation with
ratings was minimal.
# Recipe ----
rec_reg <- recipe(ratings ~ .,
data = train_notes) %>%
step_rm(note_id, tweet_id,
created_at, current_status) %>%
step_dummy(w_day, hour, classification) %>%
step_zv(all_predictors()) %>%
step_normalize(all_numeric_predictors())
I will evaluate five different models:
To calculate the models, let’s start with giving the specifications for each one, and continue creating the workflow for each model:
# Models ----
##linear model ----
linear_reg <- linear_reg() %>%
set_mode("regression") %>%
set_engine("lm")
# Polinomial Regression ----
poly_rec <- rec_reg %>%
step_poly(note_length, helpful_rate, not_helpful_rate,
degree = tune())
poly_spec <- linear_reg() %>%
set_mode("regression") %>%
set_engine("lm")
# KNN model
knn_mod <- nearest_neighbor(neighbors = tune()) %>%
set_mode("regression") %>%
set_engine("kknn")
## Elastic net ----
# Tuning penalty and mixture
en_mod <- linear_reg(penalty = tune(),
mixture = tune()) %>%
set_mode("regression") %>%
set_engine("glmnet")
## Random forest ----
# Tuning mtry (number of predictors), trees, and min_n (number of minimum values in each node)
rf_mod <- rand_forest(mtry = tune(),
trees = tune(),
min_n = tune()) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("regression")
# Workflow ----
## Linear ----
lm_wkflow <- workflow() %>%
# add model
add_model(linear_reg) %>%
# add receipe
add_recipe(rec_reg)
## POLYNOMIAL REGRESSION ----
poly_wf <- workflow() %>%
add_model(poly_spec) %>%
add_recipe(poly_rec)
## KNN ----
knn_wkflow <- workflow() %>%
# add model
add_model(knn_mod) %>%
# add receipe
add_recipe(rec_reg)
## EN ----
en_wkflow <- workflow() %>%
# add model
add_model(en_mod) %>%
# add receipe
add_recipe(rec_reg)
## RF ----
rf_wkflow <- workflow() %>%
# add model
add_model(rf_mod) %>%
# add receipe
add_recipe(rec_reg)
All the model have parameters to be tune, except for the linear model. Let’s create a grid for the parameter to evaluate in each model:
# Grids for tuning parameters ----
# For the Linear model there is no parameter to tune, so I don't need a grid or tuning.
## POLYNOMIAL REGRESSION ----
degree_grid <- grid_regular(degree(range = c(1,5)), levels = 5)
## KNN ----
knn_grid <- grid_regular(neighbors(range = c(1,15)),
levels = 5)
## EN ----
en_grid <- grid_regular(penalty(range = c(-5, 5)),
mixture(range = c(0,1)),
levels = 10)
## RF ----
rf_grid <- grid_regular(mtry(range = c(1, 12)),
trees(range = c(20,60)),
min_n(range = c(5,20)),
levels = 8)
Now that the recipe, models, workflows, and grids are ready for each model, let’s tune them. Remember that the linear model has no parameters to evaluate, so it it is not tuned. Finally, I save all the results from this process that is the most time consuming.
# Tuning ----
## POLYNOMIAL REGRESSION ----
poly_tuned <- tune_grid(
poly_wf,
resamples = notes_folds,
grid = degree_grid
)
## KKN ----
knn_tune <- tune_grid(
knn_wkflow,
resamples = notes_folds,
grid = knn_grid
)
## EN ----
en_tune <- tune_grid(
en_wkflow,
resamples = notes_folds,
grid = en_grid
)
## RF ----
rf_tune <- tune_grid(
rf_wkflow,
resamples = notes_folds,
grid = rf_grid
)
# Save tuning results ----
## Polynomial Regression ----
write_rds(poly_tuned, file = "data/tuned_models/poly.rds")
## KNN ----
write_rds(knn_tune, file = "data/tuned_models/knn.rds")
## EN ----
write_rds(en_tune, file = "data/tuned_models/elastic.rds")
## RF ----
write_rds(rf_tune, file = "data/tuned_models/rf.rds")
For the present document, we used the data stored in this files instead of running all the analysis again when knitting the Rmd file. Tuning the parameters took several minutes even considering that this is only \(5\%\) of the available data.
# Load tuning results ----
## Polynomial Regression ----
poly_tune <- read_rds(file = "../data/tuned_models/poly.rds")
## KNN ----
knn_tune <- read_rds(file = "../data/tuned_models/knn.rds")
## EN ----
en_tune <- read_rds(file = "../data/tuned_models/elastic.rds")
## RF ----
rf_tune <- read_rds(file = "../data/tuned_models/rf.rds")
Now I can look at the autoplots generated by the models that I tuned. Visualizing the autopilots of tuned models in R is a crucial step in gaining deeper insights into the complex interactions and decision-making processes inherent in machine learning. As models are fine-tuned and optimized, their autopilots, representing the automated strategies for navigating the feature space, become intricate and multidimensional. Visualization serves as a powerful tool to unravel these complexities, offering a clear and intuitive representation of the model’s behavior. The visual exploration of autopilots becomes an invaluable asset in demystifying the inner workings of sophisticated models, fostering a more transparent and comprehensible machine learning process.
# Autoplots ----
autoplot(en_tune, metric = 'rmse')
autoplot(poly_tune, metric = 'rmse')
autoplot(rf_tune, metric = 'rmse')
Les’s say something about each one of the autoplots:
Elastic Net: The model is looking better for values slightly larger than 1 in the regularization amount, and from there the model starts performing really bad. It at this optimal point of the value of regularizartion that we observe that penalty values in the middle (0.333, and 0.444) perfom better.The penalty also has an effects for values less than 0.222, but for the other values the difference seems very small.
Polynomial Regression: It seem aparent that the model performs better with a higher degree. The laargest improvement is from 3th to 4th degree.
Random Tree: It is clear that increasing the number of trees improves the RMSE for all minimal node sizes. Also the number of predictor improves the RMSE, but the gains are small for more than 7 predictors.
Let’s calculate the RMSE for each one the five models.
# Compare models ----
## Linear ----
lm_fit <- fit_resamples(lm_wkflow, resamples = notes_folds)
lm_rmse <- collect_metrics(lm_fit) %>%
slice(1)
## Polynomial Regression ----
poly_rmse <- show_best(poly_tune,metric = "rmse",n = 1) %>%
select(mean) %>% pull()
## KKN ----
knn_rmse <- show_best(knn_tune,metric = "rmse",n = 1) %>%
select(mean) %>% pull()
## EN ----
en_rmse <- show_best(en_tune,metric = "rmse",n = 1) %>%
select(mean) %>% pull()
## RF ----
rf_rmse <- show_best(rf_tune,metric = "rmse",n = 1) %>%
select(mean) %>% pull()
Now, let’s see the results in a table:
# Creating a tibble of all the models and their RMSE
final_compare_tibble <- tibble(
Model = c(
"Linear Regression",
"K Nearest Neighbors",
"Elastic Net",
"Random Forest",
"Polynomial Regression"
),
RMSE = c(
lm_rmse$mean,
knn_rmse,
en_rmse,
rf_rmse,
poly_rmse
)) %>%
arrange(RMSE)
rmarkdown::paged_table(
final_compare_tibble
)
The best model was the random forest. The values that better fit the model were mtry=\(10\), trees=\(54\), min=\(20\). Then, I have to fit the best model to the training data to get the parameters of the random forest. Finally, I am ready to implement the best random forest model to the testing data.
# Best model ----
show_best(rf_tune, metric = 'rmse', n=1)
## # A tibble: 1 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 10 54 20 rmse standard 111. 10 7.26 Preprocessor1_Model5…
best_train <- select_best(rf_tune, metric = 'rmse')
## Fit to training data ----
final_workflow_train <- finalize_workflow(rf_wkflow, best_train)
final_fit_train <- fit(final_workflow_train, data = train_notes)
## Testing the model ----
# Creating the predicted vs. actual value tibble
notes_tibble <- predict(final_fit_train, new_data = test_notes %>% select(-ratings))
notes_tibble <- bind_cols(notes_tibble, test_notes %>% select(ratings))
To see how the model predicts the testing data we can plot the number of ratings original data and the prediction made with the estimated model. In the first of the following two graphs, we can see that the model is not adjusting very well, specially for the notes that are very popular and have a large number of ratings. In the second graph, I filtered the observation to see if the adjustment is better for notes with less number of ratings.
# Creating plot of predicted values vs. actual values
notes_tibble %>%
ggplot(aes(x = .pred, y = ratings)) +
geom_point(alpha = 0.4) +
geom_abline(lty = 2) +
theme_grey() +
coord_obs_pred() +
labs(title = "Predicted Values vs. Actual Values",
x = "Model Prediction")
# Let's focus on the notes with not that many ratings
notes_tibble %>%
filter(.pred < 500,
ratings < 500) %>%
ggplot(aes(x = .pred, y = ratings)) +
geom_point(alpha = 0.4) +
geom_abline(lty = 2) +
# xlim(0, 50) +
# ylim(0, 50) +
theme_grey() +
coord_obs_pred() +
labs(title = "Predicted Values vs. Actual Values for Notes with less than 500 Ratings",
x = "Model Prediction")
Finally, let’s evaluate what are the variables that are more useful
predicting the number or ratings in the notes. The most important
variables in the random forest model estimated were:
somewhat_helpful_rate, not_helpful_rate,
created_at_millis, helpful_rate, and
note_length.
## VIP ----
# Using the training fit to create the VIP because the model was not actually fit to the testing data
final_fit_train %>%
extract_fit_engine() %>%
vip(aesthetics = list(fill = "red3", color = "blue3"))
In conclusion, our machine learning exercise aimed at predicting the
number of note ratings yielded compelling results, with the random
forest model emerging as the most effective. The predictors that
demonstrated significant influence on the outcome variable included
somewhat_helpful_rate, not_helpful_rate,
created_at_millis, helpful_rate, and
note_length. These variables provided valuable insights
into the factors influencing the engagement with Community Notes.
However, it’s crucial to acknowledge the inherent bias in our outcome variable. Notably, certain notes received substantial attention, while the majority garnered no ratings. This imbalance may have introduced complexities in the adjustment process, as none of the analyzed variables could predict the disproportionate attention some notes received. This observation opens a promising avenue for further analysis, particularly delving into the content of the notes and the tweets they reference.
Moving forward, an exciting avenue for exploration involves investigating how the topics and tone within the text of Community Notes might predict extremes—notes receiving minimal attention versus those gaining substantial traction. This hypothesis offers an intriguing opportunity to deepen our understanding of the dynamics driving user engagement within the community. The anticipated results could not only contribute to the refinement of open-source and community fact-checking policies on the internet but also potentially inform the design of more effective strategies in fostering a reliable and informed online environment. This research lays the groundwork for future endeavors, underscoring the dynamic interplay between content, engagement, and the evolving landscape of online information dissemination.
By Community Notes - https://twitter.com/CommunityNotes/photo, Public Domain, https://commons.wikimedia.org/w/index.php?curid=141534850↩︎
https://www.lemonde.fr/en/pixels/article/2023/07/03/i-spent-one-week-as-an-arbiter-of-truth-on-twitter-s-community-notes-service_6042188_13.html↩︎
https://mashable.com/article/twitter-x-community-notes-misinformation-views-investigation↩︎
By Community Notes - https://github.com/twitter/communitynotes↩︎
By Twitter - Original publication: Screenshot from CommunityNotesContributorImmediate source: https://twitter.com/i/communitynotes, Fair use, https://en.wikipedia.org/w/index.php?curid=75348629↩︎